Statistical Databases
نویسنده
چکیده
Introduction Statistical databases are databases containing statistical information. Such databases are normally released by national statistical institutes but, on occasion, they can also be released by healthcare authorities (epidemiology) or by private organizations (e.g. consumer surveys). Statistical databases typically come in three formats: • Tabular data, that is, tables with counts or magnitudes, which are the classical output of official statistics; • Queryable databases, that is, on-line databases to which the user can submit statistical queries (sums, averages, etc.); • Microdata, that is, files where each record contains information on an individual (a citizen or a company). The peculiarity of statistical databases is that they should provide useful statistical information, but they should not reveal private information on the individuals they refer to (respondents). Indeed, supplying data to national statistical institutes is compulsory in most 1 countries but, in return, those institutes commit to preserving the privacy of respondents. Inference control in statistical databases, also known as Statistical Disclosure Control (SDC), is a discipline that seeks to protect data in statistical databases so that they can be published without revealing confidential information that can be linked to specific individuals among those to which the data correspond. SDC is applied to protect respondent privacy in areas such as official statistics, health statistics, e-commerce (sharing of consumer data), etc. Since data protection ultimately means data modification, the challenge for SDC is to achieve protection with minimum loss of the accuracy sought by database users. In [1], a distinction is made between SDC and other technologies for database privacy, like privacy-preserving data mining (PPDM) or private information retrieval (PIR): what makes the difference between those technologies is whose privacy they seek. While SDC is aimed at respondent privacy, the primary goal of PPDM is to protect owner privacy when several database owners wish to cooperate in joint analyses across their databases without giving away their original data to each other. On its side, the primary goal of PIR is user privacy, that is, to allow the user of a database to retrieve some information item without the database exactly knowing which item was recovered. The literature on SDC started in the 1970s, with the seminal contribution by Dalenius [2] in the statistical community and the works by Schlörer and Denning [3, 4] in the database community. The 1980s saw moderate activity in this field. An excellent survey of the state of the art …
منابع مشابه
Nbtadata kknagement for Large Statistical Databases
Data description or metadata presents a significant database management challenge, particularly for scientific and statistical databases. Ideally, we would llke to access and manipulate data and metadata using the same DBMS tools, but there are few systems that even begin to provide such integrated capabilities. This paper outlines a framework for more integrated metadata management by synthesi...
متن کاملA Model for Representing Statistical Objects
In this paper the structure and the semantic properties of the entities stored in databases, whose data are only aggregate-type data, are defined and discussed. This choice is justified by the wide spread use of aggregate data without the corresponding raw data (i.e. micro-data, such as census data). Aggregate data are often derived by applying statistical aggregation (e.g. sum, count) and stat...
متن کاملOn the Security of Noise Addition for Privacy in Statistical Databases
Noise addition is a family of methods used in the protection of the privacy of individual data (microdata) in statistical databases. This paper is a critical analysis of the security of the methods in that family.
متن کاملAdvances in Inference Control in Statistical Databases: An Overview
Inference control in statistical databases is a discipline with several other names, such as statistical disclosure control, statistical disclosure limitation, or statistical database protection. Regardless of the name used, current work in this very active eld is rooted in the work that was started on statistical database protection in the 70s and 80s. Massive production of computerized statis...
متن کاملStatistical Computing and Databases: Distributed Computing Near the Data
This paper addresses the following question: “how do we fit statistical models efficiently with very large data sets that reside in databases?” Nowadays it is quite common to we encounter a situation where a very large data set is stored in a database, yet the statistical analysis is performed with a separate piece of software such as R. Usually it does not make much sense and in some cases it ...
متن کامل